In these exercises, you will use some of the functions from the tidyr package to make datasets tidy.
In several of our exercises (incl. this one), we will use data on global life expectancy from Gapminder and the Titanic dataset from Kaggle. In addition, for one of the exercises on tidy data, we will use an excerpt from NationMaster data on murder and intentional homicide for 2010.
First of all, copy, paste, and rund the following code in(to) your R script to load/create the datasets we will use in these exercises.
library(tidyverse)
gap_life <- read_csv("../data/gapminder/life_expectancy_years.csv")
titanic <- read_csv("../data/titanic/titanic.csv")
crime <- tibble(country = rep(c("Germany", "Brazil", "Norway"), 2),
crime = c(rep("murders", 3), rep("intentional homicide rate", 3)),
year = 2010,
value = c(690, 40974, 29, 0.84, 27, 0.68))
You should gather the years into one column/variable. If you are unsure about the arguments of a function, you can always consult the help files by typing (and running) a ? directly followed by the function name (e.g., ?glimpse). NB: This only works if you have previously loaded the package that includes the function.
-variable_name as the second argument (or first if you use pipes) to the gather() function.
gap_life_long <- gap_life %>%
gather(-country, key = "year", value = "life_exp")
gap_life_long
## # A tibble: 40,953 x 3
## country year life_exp
## <chr> <chr> <dbl>
## 1 Afghanistan 1800 28.2
## 2 Albania 1800 35.4
## 3 Algeria 1800 28.8
## 4 Andorra 1800 NA
## 5 Angola 1800 27
## 6 Antigua and Barbuda 1800 33.5
## 7 Argentina 1800 33.2
## 8 Armenia 1800 34
## 9 Australia 1800 34
## 10 Austria 1800 34.4
## # ... with 40,943 more rows
spread() the crime variable.
crime_spread <- crime %>%
spread(key = "crime", value = "value")
crime_spread
## # A tibble: 3 x 4
## country year `intentional homicide rate` murders
## <chr> <dbl> <dbl> <dbl>
## 1 Brazil 2010 27 40974
## 2 Germany 2010 0.84 690
## 3 Norway 2010 0.68 29
titanic <- titanic %>%
separate(Name, c("last_name", "first_name"), sep =", ")
titanic
## # A tibble: 891 x 13
## PassengerId Survived Pclass last_name first_name Sex Age SibSp Parch
## <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 0 3 Braund Mr. Owen ~ male 22 1 0
## 2 2 1 1 Cumings Mrs. John~ fema~ 38 1 0
## 3 3 1 3 Heikkinen Miss. Lai~ fema~ 26 0 0
## 4 4 1 1 Futrelle Mrs. Jacq~ fema~ 35 1 0
## 5 5 0 3 Allen Mr. Willi~ male 35 0 0
## 6 6 0 3 Moran Mr. James male NA 0 0
## 7 7 0 1 McCarthy Mr. Timot~ male 54 0 0
## 8 8 0 3 Palsson Master. G~ male 2 3 1
## 9 9 1 3 Johnson Mrs. Osca~ fema~ 27 0 2
## 10 10 1 2 Nasser Mrs. Nich~ fema~ 14 1 0
## # ... with 881 more rows, and 4 more variables: Ticket <chr>, Fare <dbl>,
## # Cabin <chr>, Embarked <chr>